The Bay Area is notoriously known for its high cost of living, beginning with the absurd amount of money we pay for rent. As a result, people have sprawled to surrounding cities like Gilroy to avoid these high costs and take longer times commuting. Zillow and here examined 34 cities across the nation to find the real-estate cost of commuting. This study examined different models in U.S. cities. They used Zillow to determine how much money you could save on rent for living 15 minutes further away. For places like Boston, they saw that this 15 minute move equated to an average of 13% less expensive housing.Another study used an empirical model to see the trade-off that individuals have of wages, housing prices, and commuter costs. This balance between time and money spent and saved shows population shifts towards rural areas.

To start with this project, I mapped the commute times for those in the Bay Area and counties surrounding them. This included the original 9 counties in the Bay along with Santa Cruz, San Benito,San Joaquin,Stanislaus, Monterey, Sacramento, Yolo, Merced, Fresno, and San Luis Obispo. The PUMS data for these commute times was so large, that I had to take a random sample of 20,000 rows to map and model for this project. Here are the leaflet maps of each year’s commute times for each county (2016, 2017, 2018, 2019).

In addition, to see if there were differences in commute times over the years among all the counties, I plotted density plots for each year as shown below.We can see that the commute times for most people are under 50 minutes but can be as long as 142 minutes to get to work. Looking at these rows with 142 minutes of commute time, many of them come from San Joaquin County and less so from Bay Area counties. From the histogram of transportation methods to work, we can see that most people get to work via car, truck, or van.

Density Plot I wanted to see if there was a way to predict someone’s commute time based on their income, rent, and access to internet and a smartphone. This was done using a logit model. The results are shown below.

With these results, we can see that since every value is close to 0.5, there is very little probabilistic predictions between these factors. (Intercept) RNTP SMARTPHONE PINCP ACCESS 0.1909670 0.4999984 0.5192961 0.5000002 0.5130566 We can turn this into a probability of occurrence using the predict function. So the probability of rent being 1000 dollars, having a smartphone, an income of 60,000 dollars, and access to the internet looks to be around 22%.

## (Intercept)        RNTP  SMARTPHONE       PINCP      ACCESS 
##   0.1909670   0.4999984   0.5192961   0.5000002   0.5130566
## 
## Call:
## glm(formula = JWMNP/142 ~ RNTP + SMARTPHONE + PINCP + ACCESS, 
##     family = quasibinomial(), data = sample_bay_pums_com_2019)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8723  -0.3019  -0.1128   0.1589   1.8187  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.444e+00  4.065e-02 -35.516   <2e-16 ***
## RNTP        -6.227e-06  6.769e-06  -0.920   0.3576    
## SMARTPHONE   7.722e-02  4.011e-02   1.925   0.0542 .  
## PINCP        9.271e-07  6.590e-08  14.068   <2e-16 ***
## ACCESS       5.224e-02  2.386e-02   2.190   0.0286 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.1881401)
## 
##     Null deviance: 3653.1  on 19999  degrees of freedom
## Residual deviance: 3615.0  on 19995  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
##         1 
## 0.2201353

To narrow it down to two factors, I created a linear model to compare income and commute times. The regression coefficient is extremely low, demonstrating lack of linear correlation between these two factors.

## 
## Call:
## lm(formula = JWMNP ~ PINCP, data = sample_bay_pums_com_2019)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.823 -16.391  -6.459   9.621 112.226 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.990e+01  2.380e-01  125.66   <2e-16 ***
## PINCP       2.485e-05  1.758e-06   14.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.74 on 19998 degrees of freedom
## Multiple R-squared:  0.009891,   Adjusted R-squared:  0.009841 
## F-statistic: 199.8 on 1 and 19998 DF,  p-value: < 2.2e-16

I also tried to linearly model commute times with how much people pay in rent. This also showed drastically low regression values. As a note, I filtered out rent to be anything greater than $0 as people may have purchased their home.

Lastly, to really see the lack of correlation, I plotted a few factors below.